========================================================
Instalation of packages requested and loading of its libraries
Datasets to be analized corresponds to a white wine samples of different variants of portuguese “Vinho Verde”. Inputs include objective tests (i.e. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between: 0 (very bad) and 10 (very excellent).
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10) # Missing Attribute Values: None
I’m going to start looking to the distributions of the white wine dataset. For this I am going to visualize the histograms of the different variables
of the file in order to check their distributions.
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The file contains data from 4898 white wines with numerical data of 12 parameters and one qualitative aspect and one aditional column with the number of trails. Part of the features have outliers that far from the 3rd quartile in their distributions (e.g.: fixed acidity, volatile acidity, residual sugar, total sulfur). I will create a discrete value by transforming the quality punctuation and I will include a new variable for rating the wines in bad (<5), average (5
We can observe that most of the white wines in the list are considered as “average quality”. Furtheron I will explore the data by creating histograms for each of the 12 variables (continous data). To see them better I will group them togheter.
As the distributions are skewe because most of variables have outliers I will proceed in creating two histograms as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The histogram seems to have a normal distribution and the mean value is shown above along with all the quartile stats.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The histogram seems to have a normal distribution and the mean value is shown above along with all the quartile stats.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The histogram seems to have a normal distribution and the mean value is shown above along with all the quartile stats.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The histogram had to be ploted in the log10 scale and the mean value is shown above along with all the quartile stats. It looks like there are two distributions for the residual sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The histogram seems to have a normal distribution and the mean value is shown above along with all the quartile stats.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The histogram seems to have a normal distribution and the mean value is shown above along with all the quartile stats.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The histogram seems to have a normal distribution and the mean value is shown above along with all the quartile stats.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The histogram seems to have a normal distribution and the mean value is shown above along with all the quartile stats.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The histogram seems to have a normal distribution and the mean value is shown above along with all the quartile stats.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The histogram seems to have a normal distribution and the mean value is shown above along with all the quartile stats.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
All the three plot styles have been used for the alcohol variable and then
log 10 sacle and the mean value is shown above along with all the quartile stats.
The dataset is composed of 4898 registers of white wine. For each we have data 12 different characteristics or features of which
one is a categorical variable - discrete (quality). From this variable I have created a new one clasyfing it into 3 categories according its rating. The remaining variables are physical and chemical properties e.g. %of alcohol pH, acidity, density, etc.
Quality is one of my main characteristic and the one that the consumer juges by a wine but, on the other hand, the perception of the quality of a wine is closely linked to its properties. As taste is one of the factors to take in acount I will look also at Residual sugar and Alcohol percentage.
I will investigate the relationship between the quality with main physical/chemical characteristics (acidity, content of sugar, pH, alcohol). Density could have influence on the content of alcohol.
I have created: one new categorical variable called rate to classify the wines into categories according to the quality value for each register.
If we look at the shapes of the histograms we see all having similar distributions except residual sugar and alcohol. The sacale is larger than normal due to points outside the boxplot (outliers).
First we install and load libraries to analyze relation between variables.
I will create a panel to analise relationship between the different variables.
When ploting the panels in the console and examining it with a higher resolution in the Plot we can observe that the alghoritm works vey good as it shows a high correlation (0.85) between Quality and Rate, the two identical factors ( as you can remember rate has been created from quality). I have to try other plotting function as the correlation does not stand out from a visual perspective. Therefore I will use the corrplot in order to have in red and blue and with a higher font the correlations.
Now I can identify easily the highest positive correlarions (greater than 0.45/-0.45) that are:residual.sugar and density, free.sulfur.dioxide and density, total sulfur dioxide and density and the negative correlations density and alcohol, total sulfur dioxide and alcohol, residual suhar and alcohol. Therefore from now on my parameters of interest are: -residual.sugar -alcohol -density -free.sulfur.dioxide -total sulfur dioxide
Boxplot shows relationships between quality and variables.
Boxplot shows relationships between quality and variables.
We can observe noticed that the lowest the Residual sugar is the higher the evaluated quality and the highest concentration of alcohol the greater the observed quality is. In terms of Sulfur Dioxide the level is in the middle as the corelation is not so strong as the other factors. The lowest the density is, the highest the evaluated quality.
## # A tibble: 6 <U+00D7> 4
## quality alcohol_mean alcohol_median n
## <ord> <dbl> <dbl> <int>
## 1 3 10.34500 10.45 20
## 2 4 10.15245 10.10 163
## 3 5 9.80884 9.50 1457
## 4 6 10.57537 10.50 2198
## 5 7 11.36794 11.40 880
## 6 8 11.63600 12.00 175
New data frame has been created for alcohol and quality
## # A tibble: 6 <U+00D7> 4
## quality alcohol_mean alcohol_median n
## <ord> <dbl> <dbl> <int>
## 1 3 10.34500 10.45 20
## 2 4 10.15245 10.10 163
## 3 5 9.80884 9.50 1457
## 4 6 10.57537 10.50 2198
## 5 7 11.36794 11.40 880
## 6 8 11.63600 12.00 175
Boxplot shows how the alcohol variates regarding quality.
## geom_smooth: na.rm = FALSE
## stat_smooth: na.rm = FALSE, method = lm, formula = y ~ x, se = TRUE
## position_identity
##
## Pearson's product-moment correlation
##
## data: alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
Therefore we have displayed above all the metrics for the strongest correlation including the confidence interval.
## geom_smooth: na.rm = FALSE
## stat_smooth: na.rm = FALSE, method = lm, formula = y ~ x, se = TRUE
## position_identity
##
## Pearson's product-moment correlation
##
## data: density and residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
Other strong correlation Residual sugar and density.
The dispersion plots confirms the hypotesis made before but we can observe that for Residual sugar and density we have a low sample tested. This can lead to error when making assumpltions and conclusions.
Plot of new dataframe with just with 1243 observations instead 4898 but mantaining the same 14 variables. Bivariate plot with 3 parameters that seem to have more influence over quality: residual.sugar, alcohol, denisty.
According to the correlation results and graphs given by pairs.panel and corrplot, parameters that has highest corerelation with quality are: -residual.sugar -alcohol -density -free.sulfur.dioxide -total sulfur dioxide It is some how understandable that teh consumer apreciates a higher alcoholic wine. The surprise for me was the density of the wine, the lowest the highest the quality.
Therefore it seems that a high quality white wine is high in alcohol, not sweet and not dense. Bare in mind as we haven’t conducted the experiment we can’t imply that the correlation is a causation. I am issuing my conclusion asuming here that the factors have been selected with a controlled experimient.
As observed in the correlation matrix, in general we can see that sulfur dioxide has an influence. Mantained in the median value provoques a evaluation with a higer quality grade.
The strongest relation (see pairs.panels) is given by the relation between variable rate and quality because we create the first one from the values of the second. Excluding this the next highest correlation is the positive correlation between residual sugar and density as can be seen in the graph corrplot and that I double checked with the Pearson correlation value (0,838).
##
## Pearson's product-moment correlation
##
## data: residual.sugar and quality_lm
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12524103 -0.06976101
## sample estimates:
## cor
## -0.09757683
##
## Pearson's product-moment correlation
##
## data: alcohol and quality_lm
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
##
## Pearson's product-moment correlation
##
## data: density and quality_lm
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
With this selection of data we are not able to detect the styrong correlation anymore. This was there all from the begining if I examine again the correlation plot pair.panels. I see now clearly that between all parameters and the output (quality or rate) there is no correlation above 0.5.
What we can observe is correlation between the factors but not relate to the evaluated quality.
Even so, I want to create 2D visualization in order to cluster were Excellent and Bad wines are allocated in terms of the 3 parameters. I will create visualization with the different combination of the 3 characteristics
The density plot shows that the best and separated combination for the wine quality is alcohol and density.
The graphics show no separation between the different buckets of quality for the 3 variables. Alcohol can be a good candidate to investigate further due to a partial separation.
With this patterns I will create linear models to see if I can relate quality with those 3 features.
##
## Calls:
## m1: lm(formula = I(quality_lm ~ residual.sugar), data = rw)
## m2: lm(formula = quality_lm ~ residual.sugar + alcohol, data = rw)
## m3: lm(formula = quality_lm ~ residual.sugar + alcohol + density,
## data = rw)
##
## ====================================================
## m1 m2 m3
## ----------------------------------------------------
## (Intercept) 5.987*** 2.021*** 90.313***
## (0.020) (0.117) (12.374)
## residual.sugar -0.017*** 0.022*** 0.053***
## (0.002) (0.002) (0.005)
## alcohol 0.354*** 0.246***
## (0.010) (0.018)
## density -87.886***
## (12.317)
## ----------------------------------------------------
## R-squared 0.0 0.2 0.2
## adj. R-squared 0.0 0.2 0.2
## sigma 0.9 0.8 0.8
## F 47.1 619.4 434.1
## p 0.0 0.0 0.0
## Log-likelihood -6331.2 -5802.2 -5776.8
## Deviance 3804.4 3065.3 3033.7
## AIC 12668.4 11612.3 11563.6
## BIC 12687.9 11638.3 11596.1
## N 4898 4898 4898
## ====================================================
First of all, we look for the goodness of our model, and it’s not good with values of R-square of 0 -0,2.
If we take a look to m-values, most important variable (whitin this low correlation) in the model m3 with all 3 factors combined.
I started my multivariate study with the 3 vaiables that where corelated among eachother. The correlation was strong between them but when looking closer and making the 2D desnity plot we can’t observe a real separation between the 2 popularion ( bad and excelent wines). This is confirmed by the analysis made with 3 models including each of the features according to its correlation (Pearson method) with quality: residual.sugar, alcohol, density.
When I do data analysis for problem solving I use this technique: choose the best of the best register versus worst of the worst refisters. Sadly the investigation confirms that when using this technique there is no strong correlation between these 3 variables and the wine quality.
The scatterplots do not show an interaction and when building the linear model we can observe that this is confirmed by the R squared and P value.
I have created 3 lineal models begining just with 1 feature as predictor and including in the next the an additional predictor according to its correlation with quality; so first model includes only residual.sugar, after I add alcohol, and then density also.
R-squawhite values is low (0,2) so, it will be difficult to adjust it to a model. The issue is the data that contains quality evaluation mainly for the average quality wines therefore makes it difficult to give a finding about the high quality wines.
I have selected the following 3 plots
This plot gives us an overview about the dataset.As we will discuss furtheron about the quality of the white wine we need to identify how large is aour data set and what is the range of the ratinngs provided for our data set.
This bar chart is the simplest grafic used but it lets us easily identify with the colour palette that there considerably more average wines evaluated than bad or excelent.
For the second plot I have selected the corrplot that gives us the correlation between variables. The colours offer us the perspective of easly identifying positive (blue) and negative (red) correlation. The size of the circles reflect the magnitude of the correlation, the higher the diameter of the circle highest the correlation.
The 2D density plots demonstrate that within the sample there is no separation between bad and excelent wines in terms of quality. This means that input variables are correlated togheter buy unfortunately not to the end result that is the utput variable - quality. We could start an investigation for the alcohol because only a small part pf the histograms overlap and we could assume that high alcohol wihite wines have a higher quality if other variable is involved. The canditates to be studied is low density and high alcohol values (they have a negative correlation of -0.76).
After taking a look to our white wines dataset with 4898 registers and 14 wine characteristics, I identified 3 that are strongly correlated together: residual sugar, alcohol and density.
The correlation between the variables is strong, close to 1 (e.g.: residual sugar and density have a positive correlation of 0.84).
On the other hand I am surprised that the projects that i found on GitHub and treat the white wine data all refer to a correlation of the white wine variables (some of them) and the quality even though we can observe within the correlation plot with a low correlation with the final output.
Taking in account the correlation between those characteristics I have made a linear model that is not predicting very well because R-square value is just 0,2 but allows us to confirm the conclusion.
In order more reliable results we should have had a continuous feature for quality and therefore distinguish better between a wine with a 8.5 and one of 9. It will be interesting to make the same calculation with the cleaned dataset of what we identify with the boxplot as outliers and try with more complex models instead linear ones to analyze the relation between the 3 variables and the main output. For further investigation we will need more data for bad and excellent ones in order to be able to have a prediction algorithm ( this is one of the main reasons why I did not eliminate outliers).
I struggled to find a correlation between the variables provided and the quality of the wine but I did not find one, even when creating a linear model. I had issues at the begining with the packages installation untill I understood that is better to install them in the console and after just call the library. It took me a long time to put all the units of measures and titles and what I find especially time consuming is to separate every string of text that is longer than 80 characters. What I had whised to know from the start is the function .tabset, it made the html more easy to be viewed and definitely with a better deign.
I consider myself successful in being able to learn how to explore the data in R and the different possibilities to visualize and summarize data provided. I discovered a software that is to my liking and more centered on the statistic part of exploring data and with surprisingly esthetic visualization. I love the way it publishes the data easily in html and it offers you also the free hosting.